We are replicating the result from paper Predicting the Present with Google Trends.
The sales data: We first tried the link provided in the paper, but the data is adjusted. We were able to find the unadjusted data through this link. However, the values were different from the original data(used by the authors of the paper). The difference increases as year increases. The reason for that is still unclear. We will compare the two data set in the next session.
The search data: Trucks & SUVs Auto Insurance
We noticed that the sales from the data we obtained were always less than the data used in the paper. We want to find out why.
# load the data
sales <- read_excel("sales.xls")
merged <- read_csv("merged.csv")
# Period has type char in the data, convert that to yearmonth
ym <- as.yearmon(sales$Period, "%b-%Y")
# use as.Date() to convert the type of ym to date
sales$Period <- as.Date(ym)
#keep only data from 01/2004 through 07/2011 and renames the data frame as unadj, rename the column Value as sales,and add label "unadjusted"
unadj<- sales %>%
filter(Period <="2011-07-01") %>%
mutate(label = "unadjusted") %>%
rename(sales = Value)
# convert sales to numeric
unadj$sales <-as.numeric(unadj$sales)
# add label to original data, keep only Period, sales and label
orig <- merged %>%
mutate(label = "original") %>%
select(-insurance, -suvs)
# stack two data frame
com_data <- rbind(unadj, orig)
#plot log(sales) Vs Period
ggplot(com_data, aes(x=Period, y = log(sales), color = label)) +
geom_line()
# plot orignal against unadjusted
joined_data <- unadj %>% select(-label) %>% rename(unadjusted_sales = sales) %>% left_join(orig, by ="Period") %>% select(-label)
ggplot(joined_data, aes(x = sales, y = unadjusted_sales)) +
geom_point()+
geom_abline(linetype = "dashed") +
xlab('original') +
ylab('unadjusted')+
ggtitle("original sales against unadjusted sales")
From the first graph, we can see that the sales values from both original and unajusted are very close to each other(almost overlapping). When I plotted originaldata against unadjusted data, they almost lined up on the diagonal. Note that original is always less than unadjusted. I wonder if people went back and modified that data after the authors obtained the data. Further investigation is needed. In conclusion, the unadjusted data we found is very close to the original data.
The authors did not specify what kind of transformation/normalization/standarlization they performed on Google Trends data. Below is what we have tried to transform trends data. 1. new_value = (old_value-first_observed_value)/max(old_value) 2. Min-Max Normalization 3. Use scale()
None of these match with the original data. Therefore, we decided to process without any transformtion on Google Trends data.
We would like to compare the outputs of unadjusted data with the outputs from the original data.
#take the log of sales, take out the label column
unadj_rep <- unadj %>% mutate(sales=log(sales)) %>% select(-label)
# google_trends contains trends data and Period for join
google_trends <- merged %>% select(-sales)
# join trends to unadj_rep
unadj_with_trends <- unadj_rep %>% left_join(google_trends, by = "Period")
# replicate baseline model
model1 <- lm(data = unadj_with_trends, sales~lag(sales, 1)+lag(sales,12))
#the summary of the model
summary(model1)
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = unadj_with_trends)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20921 -0.03504 0.00273 0.04081 0.22423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66776 0.75960 0.879 0.382126
## lag(sales, 1) 0.65099 0.07282 8.939 1.75e-13 ***
## lag(sales, 12) 0.28853 0.07225 3.994 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08015 on 76 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.7212, Adjusted R-squared: 0.7139
## F-statistic: 98.31 on 2 and 76 DF, p-value: < 2.2e-16
# replicate the model with trends
model_with_trend <- lm(data = unadj_with_trends, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
summary(model_with_trend)
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12) + suvs +
## insurance, data = unadj_with_trends)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.162083 -0.043363 0.004472 0.036573 0.164297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.45901 0.78723 -0.583 0.561620
## lag(sales, 1) 0.63000 0.06324 9.961 2.60e-15 ***
## lag(sales, 12) 0.41801 0.06544 6.387 1.33e-08 ***
## suvs 1.04187 0.16891 6.168 3.34e-08 ***
## insurance -0.52806 0.15401 -3.429 0.000994 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06593 on 74 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.8163, Adjusted R-squared: 0.8064
## F-statistic: 82.2 on 4 and 74 DF, p-value: < 2.2e-16
# replicate figure 2
# creating base
base_unajusted <- unadj_with_trends
for (i in 18:91){
merged_t <- unadj_with_trends[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
base_unajusted$sales[i] <- predict(model1,unadj_with_trends[1:i,])[i]
}
base_unajusted <- base_unajusted[18:91,]
# creating trends
trends_unajusted <- unadj_with_trends
for (i in 18:91){
merged_t <- unadj_with_trends[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
trends_unajusted$sales[i] <- predict(model1,unadj_with_trends[1:i,])[i]
}
trends_unajusted <- trends_unajusted[18:91,]
# label different data sets
actual_unajusted <- unadj_with_trends[18:91,]
actual_unajusted <- actual_unajusted %>%
mutate(label ="actual_unajusted")
base_unajusted <- base_unajusted %>%
mutate(label = "base_unajusted")
trends_unajusted <- trends_unajusted %>%
mutate(label ="trends_unajusted")
# R^2 for base
(cor(base_unajusted$sales,actual_unajusted$sales))^2
## [1] 0.7266
# R^2 for trends
(cor(trends_unajusted$sales,actual_unajusted$sales))^2
## [1] 0.7911028
# stack all data sets for plotting
plot_data <- rbind(actual, base, trends, actual_unajusted, base_unajusted, trends_unajusted)
ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black","black", "red","red","grey","grey"))+
scale_linetype_manual(values = c("solid","solid","dashed", "dashed", "solid","solid"))+
ylab('log(mvp)')+
xlab('Index')
From the graph, we see that lines are almost overlaped, and \(R^2\) are very close to the ones from original data.
# trend 1
# search data in category Trucks & SUVs between 01/01/2004 and 07/01/2011
suvs_trends <- read_csv("New_Trucks_suvs.csv")
names1<- names(suvs_trends)
suvs_trends <- suvs_trends %>%
rename(suvs = names1[2],
Period = Month)
# trend 2
# search data in category Auto Insurance between 01/01/2004 and 07/01/2011
insurance_trends <- read_csv("New_auto_insurance.csv")
names2 <- names(insurance_trends)
insurance_trends <- insurance_trends %>%
rename(insurance = names2[2],
Period = Month)
# census data
unadj_full <- sales %>% filter(Period <= "2011-07-01")
#join trends data
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))
# join all data
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value)
# take the log of sales
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12))
# summary of model1
tidy(model1)
m1 <- glance(model1)
# model with trends
model_with_trend <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) + insurance+ suvs)
# summary table of the model
# summary(model_with_trend)
tidy(model_with_trend)
mt<-glance(model_with_trend)
The \(R^2\) for base model is 0.7212228, and The \(R^2\) for base model is 0.7830012. We see an improvement in overall fitting. We will further investigate if this is true with out-of-sample forcasting.
base <- with_trends_full
# creating predicted base with rolling window nowcast
for (i in 18:91){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
base$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
base <- base[18:91,]
# creating predicted trends with rolling window nowcast
trends <- with_trends_full
for (i in 18:91){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
trends$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
trends <- trends[18:91,]
# Make the graph
actual <- with_trends_full[18:91,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
# means absolute error
MAE_base <- mean(abs(base$sales-actual$sales))
MAE_trends <- mean(abs(trends$sales-actual$sales))
# data for recession period Dec 2007 to June 2009
recession_trends <- trends %>%
filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base <- base %>%
filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual <- actual %>%
filter(Period>="2007-12-01"& Period<="2009-06-01")
MAE_recession_trends <- mean(abs(recession_trends$sales-recession_actual$sales))
MAE_recession_base <- mean(abs(recession_base$sales-recession_actual$sales))
# Overall improvement
overall_improv <- (mean(abs(base$sales-actual$sales))-mean(abs(trends$sales-actual$sales)))/mean(abs(base$sales-actual$sales))
# recession improvement
recession_improv <- (mean(abs(recession_base$sales-recession_actual$sales))-mean(abs(recession_trends$sales-recession_actual$sales)))/mean(abs(recession_base$sales-recession_actual$sales))
#R^2 for base
r_sqr_base <-(cor(base$sales,actual$sales))^2
# R^2 for trends
r_sqr_trends <- (cor(trends$sales[3:73],actual$sales[3:73]))^2
#We can use MAE() from caret package to calculate MAE
#
# library(caret)
#
# # R native funcitons
# MAE(recession_trends$sales, recession_actual$sales)
# MAE(recession_base$sales, recession_actual$sales)
#
# MAE(trends$sales, actual$sales)
# MAE(base$sales, actual$sales)
The MAE for base model is 0.0636592, and the MAE for the model with trends is 0.0668723. The overall improvement is -5.0473494%. That means adding trends to the model make things worse. We see a totally different story when forces on recession period. The MAE for base model is 0.0890323, and the MAE for the model with trends is 0.0679496 during the recession. There is 23.6798011% improvement after adding trends to the model.
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>%
rename(suvs = names1[2],
Period = Month)
insurance_trends <- insurance_trends %>%
rename(insurance = names2[2],
Period = Month)
unadj_full <- sales %>% filter(Period <= "2020-05-01")
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value)
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12))
#the summary of the model
summary(model1)
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = with_trends_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.280436 -0.038405 0.003295 0.040140 0.210174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03950 0.33363 -0.118 0.906
## lag(sales, 1) 0.54398 0.04544 11.970 <2e-16 ***
## lag(sales, 12) 0.46044 0.04742 9.709 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07389 on 182 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.8645, Adjusted R-squared: 0.863
## F-statistic: 580.5 on 2 and 182 DF, p-value: < 2.2e-16
# model with trends
model_with_trend <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
# summary table of the model
summary(model_with_trend)
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12) + suvs +
## insurance, data = with_trends_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.276030 -0.033915 0.000359 0.041266 0.191254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1787983 0.5066360 2.327 0.02109 *
## lag(sales, 1) 0.4873952 0.0477822 10.200 < 2e-16 ***
## lag(sales, 12) 0.4001740 0.0507214 7.890 2.84e-13 ***
## suvs 0.0025711 0.0008165 3.149 0.00192 **
## insurance -0.0016652 0.0006095 -2.732 0.00692 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07222 on 180 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.872, Adjusted R-squared: 0.8691
## F-statistic: 306.5 on 4 and 180 DF, p-value: < 2.2e-16
# Rolling window
base <- with_trends_full
len <- nrow(base)
# creating predicted base with rolling window forecast
for (i in 18:len){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
base$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
base <- base[18:len,]
# creating predicted trends with rolling window forecast
trends <- with_trends_full
for (i in 18:len){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
trends$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
trends <- trends[18:len,]
# Make the graph
actual <- with_trends_full[18:len,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
Add commends here
We want forcast by using the model \(y_t+2=b_1y_{t}+b_2y_{t-12} + e_t\). We’ll use rolling window forcast to evaluate our model. At the end, We will try to predict the sales of July, 2020.
commend on yearly trends
## getting data ready
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>%
rename(suvs = names1[2],
Period = Month)
insurance_trends <- insurance_trends %>%
rename(insurance = names2[2],
Period = Month)
unadj_full <- sales %>% filter(Period <= "2020-05-01")
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value)
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
#with_trends_full$sales = as.numeric(with_trends_full$sales)
## starting from 2018
whole_data_2018 <- with_trends_full %>%
filter(Period >="2015-01-01") # we want to predict 2018-01-01 when it is 2017-10-01, we need data from 2016-10-01
#months <- nrow(whole_data_2018)
months <- 60
base <- whole_data_2018
start <- 37
# creating predicted base with rolling window nowcast
for (i in start:months){
merged_t <- whole_data_2018[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 2)+lag(sales,14))
base$sales[i] <- predict(model1,whole_data_2018[1:i,])[i]
}
base <- base[start:months,]
# creating predicted trends with rolling window nowcast
trends <- whole_data_2018
for (i in start:months){
merged_t <- whole_data_2018[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 2)+lag(sales,14)+ suvs + insurance)
trends$sales[i] <- predict(model1,whole_data_2018[1:i,])[i]
}
trends <- trends[start:months,]
# Make the graph
actual <- whole_data_2018[start:months,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
MAE(trends$sales, actual$sales)
## [1] 0.05632414
MAE(base$sales, actual$sales)
## [1] 0.07088563
(MAE(base$sales, actual$sales)-MAE(trends$sales, actual$sales))/MAE(base$sales, actual$sales)
## [1] 0.2054223
#R^2 for base
(cor(base$sales,actual$sales))^2
## [1] 0.004935371
# R^2 for trends
(cor(trends$sales,actual$sales))^2
## [1] 0.1636343
The model did very bad. observed yearly trends
# yearly trends
ggplotly(ggplot(with_trends_full, aes(x = Period, y = sales)) + geom_line())
with model \(y_t = b_1y_{t-1}+b12y_{t-12}+e_t\)(AR-1 model) much better performance
## getting data ready
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>%
rename(suvs = names1[2],
Period = Month)
insurance_trends <- insurance_trends %>%
rename(insurance = names2[2],
Period = Month)
unadj_full <- sales %>% filter(Period <= "2020-05-01")
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value)
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
#with_trends_full$sales = as.numeric(with_trends_full$sales)
ggplotly(ggplot(with_trends_full, aes(x = Period, y = sales)) + geom_line())
## starting from 2018
whole_data_2018 <- with_trends_full %>%
filter(Period >="2015-01-01") # we want to predict 2018-01-01 when it is 2017-10-01, we need data from 2016-10-01
#months <- nrow(whole_data_2018)
months <- 60
base <- whole_data_2018
start <- 37
# creating predicted base with rolling window nowcast
for (i in start:months){
merged_t <- whole_data_2018[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
base$sales[i] <- predict(model1,whole_data_2018[1:i,])[i]
}
base <- base[start:months,]
# creating predicted trends with rolling window nowcast
trends <- whole_data_2018
for (i in start:months){
merged_t <- whole_data_2018[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
trends$sales[i] <- predict(model1,whole_data_2018[1:i,])[i]
}
trends <- trends[start:months,]
# Make the graph
actual <- whole_data_2018[start:months,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
MAE(trends$sales, actual$sales)
## [1] 0.02392399
MAE(base$sales, actual$sales)
## [1] 0.02278401
(MAE(base$sales, actual$sales)-MAE(trends$sales, actual$sales))/MAE(base$sales, actual$sales)
## [1] -0.0500341
#R^2 for base
(cor(base$sales,actual$sales))^2
## [1] 0.8951386
# R^2 for trends
(cor(trends$sales,actual$sales))^2
## [1] 0.8936784
data_for_pre_20 <- with_trends_full %>%
filter(Period >="2017-10-01"& Period <="2019-12-01")
model1 <- lm(data = data_for_pre_20 , sales~lag(sales, 1)+lag(sales,12))
data_2020 <- with_trends_full %>% filter(Period >="2019-01-01")
summary(model1)
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = data_for_pre_20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.041858 -0.008167 -0.002117 0.010040 0.035588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.61815 1.14542 -2.286 0.0412 *
## lag(sales, 1) 0.15053 0.07233 2.081 0.0595 .
## lag(sales, 12) 1.07968 0.08664 12.462 3.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02108 on 12 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9377, Adjusted R-squared: 0.9273
## F-statistic: 90.28 on 2 and 12 DF, p-value: 5.858e-08
predicted <- predict(model1,data_2020)[13:17]
MSE <- mean((predicted-data_2020$sales[13:17])^2)
mse(model1, data_2020)
## [1] 0.0540795
#We will predict June 2020 and then July 2020
n <- tidy(model1)
June <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[6])
June
## [1] 109322.3
July <- exp(n$estimate[1] +n$estimate[2]*log(June)+n$estimate[3]*data_2020$sales[7])
July
## [1] 115180.4
data_for_pre_20 <- with_trends_full %>%
filter(Period >="2017-10-01"& Period <="2019-12-01")
model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 1)+lag(sales,12) + insurance + suvs)
data_2020 <- with_trends_full %>% filter(Period >="2019-01-01")
summary(model_with_trend )
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12) + insurance +
## suvs, data = data_for_pre_20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.041737 -0.008079 -0.001873 0.010327 0.035907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.6407681 1.4175371 -1.863 0.0921 .
## lag(sales, 1) 0.1501538 0.0821828 1.827 0.0976 .
## lag(sales, 12) 1.0824753 0.1104340 9.802 1.91e-06 ***
## insurance 0.0001865 0.0032765 0.057 0.9557
## suvs -0.0001653 0.0023871 -0.069 0.9462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02308 on 10 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9377, Adjusted R-squared: 0.9128
## F-statistic: 37.64 on 4 and 10 DF, p-value: 5.334e-06
predicted <- predict(model_with_trend ,data_2020)[13:17]
MSE <- mean((predicted-data_2020$sales[13:17])^2)
mse(model_with_trend , data_2020)
## [1] 0.05457793
#We will predict July 2020
n <- tidy(model_with_trend )
June <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[6]+n$estimate[4]*data_2020$insurance[17]+n$estimate[5]*data_2020$suvs[17])
June
## [1] 109221.7
data_for_pre_20 <- with_trends_full %>%
filter(Period >="2017-10-01"& Period <="2019-12-01")
model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 2)+lag(sales,14) + insurance + suvs)
data_2020 <- with_trends_full %>% filter(Period >="2019-01-01")
summary(model_with_trend )
##
## Call:
## lm(formula = sales ~ lag(sales, 2) + lag(sales, 14) + insurance +
## suvs, data = data_for_pre_20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.143274 -0.022322 -0.003005 0.036227 0.083648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262405 4.386827 1.200 0.2646
## lag(sales, 2) 0.405251 1.185718 0.342 0.7413
## lag(sales, 14) 0.032898 1.421701 0.023 0.9821
## insurance -0.011093 0.012547 -0.884 0.4024
## suvs 0.019694 0.009837 2.002 0.0803 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0743 on 8 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.431, Adjusted R-squared: 0.1465
## F-statistic: 1.515 on 4 and 8 DF, p-value: 0.2855
predicted <- predict(model_with_trend ,data_2020)[13:17]
MSE <- mean((predicted-data_2020$sales[13:17])^2)
mse(model_with_trend , data_2020)
## [1] 0.006739459
#We will predict July 2020
n <- tidy(model_with_trend )
July <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[5]+n$estimate[4]*data_2020$insurance[17]+n$estimate[5]*data_2020$suvs[17])
July-115180.4
## [1] -7044.656
data_for_pre_20 <- with_trends_full %>%
filter(Period >="2010-01-01"& Period <="2016-12-01")
# model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ insurance + suvs)
model_with_trend <- lm(data = data_for_pre_20 , sales~lag(sales, 12)+lag(sales, 48)+ insurance + suvs)
data_2020 <- with_trends_full %>% filter(Period >="2013-01-01")
summary(model_with_trend )
##
## Call:
## lm(formula = sales ~ lag(sales, 12) + lag(sales, 48) + insurance +
## suvs, data = data_for_pre_20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.045072 -0.014807 0.000768 0.009903 0.041889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4942178 0.4559826 1.084 0.2868
## lag(sales, 12) 0.5807464 0.0854021 6.800 1.29e-07 ***
## lag(sales, 48) 0.3957707 0.0801927 4.935 2.59e-05 ***
## insurance 0.0016225 0.0015207 1.067 0.2942
## suvs -0.0021911 0.0009134 -2.399 0.0226 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01921 on 31 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9543
## F-statistic: 183.9 on 4 and 31 DF, p-value: < 2.2e-16
predicted <- predict(model_with_trend ,data_2020)[49:89]
MSE <- mean((predicted-data_2020$sales[49:89])^2)
mse(model_with_trend , data_2020)
## [1] 0.01147997
#We will predict July 2020
n <- tidy(model_with_trend )
July <- exp(n$estimate[1] +n$estimate[2]*data_2020$sales[17]+n$estimate[3]*data_2020$sales[5]+n$estimate[4]*data_2020$insurance[17]+n$estimate[5]*data_2020$suvs[17])
July-115180.4
## [1] -8875.708